1 Linear (Gaussian) Models

Configuring R

Functions from these packages will be used throughout this document:

[R code]
library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

[R code]
rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 6)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9
run_graphs = TRUE
[R code]
include_reference_lines <- FALSE

Note

This content is adapted from:

  • Dobson and Barnett (2018), Chapters 2-6
  • Dunn and Smyth (2018), Chapters 2-3
  • Vittinghoff et al. (2012), Chapter 4

There are numerous textbooks specifically for linear regression, including:

  • Kutner et al. (2005): used for UCLA Biostatistics MS level linear models class
  • Chatterjee and Hadi (2015): used for Stanford MS-level linear models class
  • Seber and Lee (2012): used for UCLA Biostatistics PhD level linear models class and UC Davis STA 108.
  • Kleinbaum et al. (2014): same first author as Kleinbaum and Klein (2010) and Kleinbaum and Klein (2012)
  • Linear Models with R (Faraway 2025)
  • Applied Linear Regression by Sanford Weisberg (Weisberg 2005)

For more recommendations, see the discussion on Reddit.

1.1 Overview

Why this course includes linear regression

  • This course is about generalized linear models (for non-Gaussian outcomes)
  • UC Davis STA 108 (“Applied Statistical Methods: Regression Analysis”) is a prerequisite for this course, so everyone here should have some understanding of linear regression already.
  • We will review linear regression to:
  • make sure everyone is caught up
  • to provide an epidemiological perspective on model interpretation.

Chapter overview

  • Section 1.2: how to interpret linear regression models

  • Section 1.3: how to estimate linear regression models

  • Section 1.4: how to quantify uncertainty about our estimates

  • Section 1.8: how to tell if your model is insufficiently complex

1.2 Understanding Gaussian Linear Regression Models

Motivating example: birthweights and gestational age

Suppose we want to learn about the distributions of birthweights (outcome \(Y\)) for (human) babies born at different gestational ages (covariate \(A\)) and with different chromosomal sexes (covariate \(S\)) (Dobson and Barnett (2018) Example 2.2.2).

Dobson birthweight data

[R code]
library(dobson)
data("birthweight", package = "dobson")
birthweight
Table 1: birthweight data (Dobson and Barnett (2018) Example 2.2.2)
[R code]
library(tidyverse)
bw <-
  birthweight |>
  pivot_longer(
    cols = everything(),
    names_to = c("sex", ".value"),
    names_sep = "s "
  ) |>
  rename(age = `gestational age`) |>
  mutate(
    id = row_number(),
    sex = sex |>
      case_match(
        "boy" ~ "male",
        "girl" ~ "female"
      ) |>
      factor(levels = c("female", "male")),
    male = sex == "male",
    female = sex == "female"
  )

bw
Table 2: birthweight data reshaped
[R code]
plot1 <- bw |>
  ggplot(aes(
    x = age,
    y = weight,
    shape = sex,
    col = sex
  )) +
  theme_bw() +
  xlab("Gestational age (weeks)") +
  ylab("Birthweight (grams)") +
  theme(legend.position = "bottom") +
  # expand_limits(y = 0, x = 0) +
  geom_point(alpha = .7)
print(plot1 + facet_wrap(~sex))
Figure 1: birthweight data (Dobson and Barnett (2018) Example 2.2.2)

Data notation

  • \(Y\): birthweight (measured in grams)
  • \(S\): chromosomal sex: “male” (XY) or “female” (XX)
  • \(M\): indicator variable for \(S\) = “male”1
  • \(M = 0\) if \(S\) = “female”
  • \(M = 1\) if \(S\) = “male”
  • \(F\): indicator variable for \(S\) = “female”2
  • \(F = 1\) if \(S\) = “female”
  • \(F = 0\) if \(S\) = “male”
  • \(A\): estimated gestational age at birth (measured in weeks).

Parallel lines regression

(c.f. Dunn and Smyth (2018) §2.10.3)

\[ \begin{aligned} Y|M,A &\ \sim_{\text{ciid}}\ N(\mu(M,A), \sigma^2)\\ \mu(m,a) &= \beta_0 + \beta_M m + \beta_A a \end{aligned} \tag{1}\]

[R code]
bw_lm1 <- lm(
  formula = weight ~ sex + age,
  data = bw
)

library(parameters)
bw_lm1 |>
  parameters::parameters() |>
  parameters::print_md(
    include_reference = include_reference_lines,
    select = "{estimate}"
  )
Table 3: Regression parameter estimates for Model 1 of birthweight data
Parameter Coefficient
(Intercept) -1773.32
sex (male) 163.04
age 120.89
[R code]
bw <-
  bw |>
  mutate(`E[Y|X=x]` = fitted(bw_lm1)) |>
  arrange(sex, age)

plot2 <-
  plot1 %+% bw +
  geom_line(aes(y = `E[Y|X=x]`))

print(plot2)
Figure 2: Graph of Model 1 for birthweight data

Model assumptions and predictions

Exercise 1 What’s the mean birthweight for a female born at 36 weeks?

Estimated coefficients for model 1
Parameter Coefficient
(Intercept) -1773.32
sex (male) 163.04
age 120.89

Solution.  

Estimated coefficients for model 1
Parameter Coefficient
(Intercept) -1773.32
sex (male) 163.04
age 120.89
[R code]
pred_female <- coef(bw_lm1)["(Intercept)"] + coef(bw_lm1)["age"] * 36
### or using built-in prediction:
pred_female_alt <- predict(bw_lm1, newdata = tibble(sex = "female", age = 36))

\[ \begin{aligned} E[Y|M = 0, A = 36] &= \beta_0 + {\left(\beta_M \cdot 0\right)}+ {\left(\beta_A \cdot 36\right)} \\ &= -1773.321839 + {\left(163.039303 \cdot 0\right)} + {\left(120.894327 \cdot 36\right)} \\ &= 2578.873934 \end{aligned} \]

Exercise 2 What’s the mean birthweight for a male born at 36 weeks?

Estimated coefficients for model 1
Parameter Coefficient
(Intercept) -1773.32
sex (male) 163.04
age 120.89

Solution.  

Estimated coefficients for model 1
Parameter Coefficient
(Intercept) -1773.32
sex (male) 163.04
age 120.89
[R code]
pred_male <-
  coef(bw_lm1)["(Intercept)"] +
  coef(bw_lm1)["sexmale"] +
  coef(bw_lm1)["age"] * 36

\[ \begin{aligned} E[Y|M = 1, A = 36] &= \beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36 \\ &= 2741.913237 \end{aligned} \]

Exercise 3 What’s the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?

[R code]
coef(bw_lm1)
#> (Intercept)     sexmale         age 
#>   -1773.322     163.039     120.894

Solution. \[ \begin{aligned} & E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\ &= 2741.913237 - 2578.873934\\ &= 163.039303 \end{aligned} \]

Shortcut:

\[ \begin{aligned} & E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\ &= (\beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36) - (\beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36) \\ &= \beta_M \\ &= 163.039303 \end{aligned} \]

Coefficient Interpretation

\[ E[Y|M=m,A=a] = \mu(m,a) = \beta_0 + \beta_M m + \beta_A a \]

Slope (of the mean with respect to age) for males: \[ \begin{aligned} \frac{d}{da}\mu(1,a) &= \frac{d}{da}(\beta_0 + \beta_M 1 + \beta_A a)\\ &= \left(\frac{d}{da}\beta_0 + \frac{d}{da}\beta_M 1 + \frac{d}{da}\beta_A a\right)\\ &= (0 + 0 + \beta_A) \\ &= \beta_A \end{aligned} \]

Slope for females:

\[ \begin{aligned} \frac{d}{da}\mu(0,a) &= \frac{d}{da}(\beta_0 + \beta_M 0 + \beta_A a)\\ &= \left(\frac{d}{da}\beta_0 + \frac{d}{da}\beta_M 0 + \frac{d}{da}\beta_A a\right)\\ &= (0 + 0 + \beta_A) \\ &= \beta_A \end{aligned} \]

Exercise 4 What is the interpretation of \(\beta_A\) in Model 1?

Solution. \[ \begin{aligned} \frac{d}{da}\mu(m,a) &= \frac{d}{da}(\beta_0 + \beta_M m + {\color{red}\beta_A} a)\\ &= \left(\frac{d}{da}\beta_0 + \frac{d}{da}\beta_M m + \frac{d}{da}{\color{red}\beta_A} a\right)\\ &= (0 + 0 + {\color{red}\beta_A}) \\ &= {\color{red}\beta_A} \end{aligned} \]

Conclusion:

\[\beta_A = \frac{d}{da}\mu(m,a)\]

\[\beta_A = E[Y|M = m, A = a+1] - E[Y|M = m, A = a]\]

Exercise 5 What is the interpretation of \(\beta_M\) in Model 1?

Solution.

\[ \begin{aligned} E[Y|M = 1,A = a] &= \beta_0 + {\color{red}\beta_M} 1 + \beta_A a \\ & = \beta_0 + {\color{red}\beta_M} + \beta_A a \\ E[Y|M = 0,A = a] &= \beta_0 + {\color{red}\beta_M} 0 + \beta_A a \\ & = \beta_0 + \beta_A a \end{aligned} \] So: \[ \begin{aligned} E[Y|M = 1,A = a] - E[Y|M = 0,A = a] &= (\beta_0 + {\color{red}\beta_M} + \beta_A a) - (\beta_0 + \beta_A a) \\ &= {\color{red}\beta_M} \end{aligned} \] Therefore: \[ \begin{aligned} \beta_M &= E[Y|M = 1,A = a] - E[Y|M = 0,A = a]\\ & =\mu(1,a) - \mu(0,a)\\ \end{aligned} \]

In words: \(\beta_M\) is the difference in mean birthweight between males and females adjusting for age.

Exercise 6 \(\beta_0 = ?\)

Solution. \[ \begin{aligned} \text{E}{\left[Y|M = 0,A = 0\right]} &= \mu(0,0)\\ &= {\color{red}\beta_0} + \beta_M 0 + \beta_A 0\\ &= {\color{red}\beta_0}\\ {\color{red}\beta_0} &= \text{E}{\left[Y|M = 0,A = 0\right]} = \mu(0,0) \end{aligned} \]

Interactions

\[E[Y|S=s,A=a] = \beta_0 + \beta_A a+ \beta_M m + \beta_{AM} (a \cdot m) \tag{2}\]

[R code]
bw_lm2 <- lm(weight ~ sex + age + sex:age, data = bw)
bw_lm2 |>
  parameters() |>
  parameters::print_md(
    include_reference = include_reference_lines,
    select = "{estimate}"
  )
Table 4: Birthweight model with interaction term
Parameter Coefficient
(Intercept) -2141.67
sex (male) 872.99
age 130.40
sex (male) × age -18.42
[R code]
bw <-
  bw |>
  mutate(
    predlm2 = predict(bw_lm2)
  ) |>
  arrange(sex, age)

plot1_interact <-
  plot1 %+% bw +
  geom_line(aes(y = predlm2))

print(plot1_interact)
Figure 3: Birthweight model with interaction term

Here’s another way we could rewrite this model (by collecting terms involving \(S\)):

\[ E[Y|M,A] = \beta_0 + \beta_M M+ (\beta_A + \beta_{AM} M) A \]

Exercise 7 According to this model, what’s the mean birthweight for a female born at 36 weeks?

Parameter Coefficient
(Intercept) -2141.67
sex (male) 872.99
age 130.40
sex (male) × age -18.42

Solution.  

Parameter Coefficient
(Intercept) -2141.67
sex (male) 872.99
age 130.40
sex (male) × age -18.42
[R code]
pred_female <- coef(bw_lm2)["(Intercept)"] + coef(bw_lm2)["age"] * 36

\[ E[Y|M = 0, X_2 = 36] = \beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36 + \beta_{AM} \cdot (0 * 36) = 2552.733333 \]

Exercise 8 What’s the mean birthweight for a male born at 36 weeks?

Parameter Coefficient
(Intercept) -2141.67
sex (male) 872.99
age 130.40
sex (male) × age -18.42

Solution.  

Parameter Coefficient
(Intercept) -2141.67
sex (male) 872.99
age 130.40
sex (male) × age -18.42
[R code]
pred_male <-
  coef(bw_lm2)["(Intercept)"] +
  coef(bw_lm2)["sexmale"] +
  coef(bw_lm2)["age"] * 36 +
  coef(bw_lm2)["sexmale:age"] * 36

\[ \begin{aligned} E[Y|M = 1, X_2 = 36] &= \beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36 + \beta_{AM} \cdot 1 \cdot 36\\ &= 2762.706897 \end{aligned} \]

Exercise 9 What’s the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?

Solution. \[ \begin{aligned} & E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\ &= (\beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36 + \beta_{AM} \cdot 1 \cdot 36)\\ &\ \ \ \ \ -(\beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36 + \beta_{AM} \cdot 0 \cdot 36) \\ &= \beta_{S} + \beta_{AM}\cdot 36\\ &= 209.973563 \end{aligned} \]

Coefficient Interpretation

Exercise 10 What is the interpretation of \(\beta_{M}\) in Model 2?

Solution.  

Mean birthweight among males with gestational age 0 weeks: \[ \begin{aligned} \mu(1,0) &= \text{E}{\left[Y|M = 1,A = 0\right]}\\ &= \beta_0 + {\color{red}\beta_M} \cdot 1 + \beta_A \cdot 0 + \beta_{AM}\cdot 1 \cdot 0\\ &= \beta_0 + {\color{red}\beta_M} \end{aligned} \] Mean birthweight among females with gestational age 0 weeks: \[ \begin{aligned} \mu(0,0) &= \text{E}{\left[Y|M = 0,A = 0\right]}\\ &= \beta_0 + {\color{red}\beta_M} \cdot 0 + \beta_A \cdot 0 + \beta_{AM}\cdot 0 \cdot 0\\ &= \beta_0 \end{aligned} \]

\[ \begin{aligned} \beta_{M} &= \mu(1,0) - \mu(0,0) \\ &= \text{E}{\left[Y|M = 1,A = 0\right]} - \text{E}{\left[Y|M = 0,A = 0\right]} \end{aligned} \] \(\beta_M\) is the difference in mean birthweight between males with gestational age 0 weeks and females with gestational age 0 weeks.

Exercise 11 What is the interpretation of \(\beta_{AM}\) in Model 2?

Solution.  

Slope among males: \[ \begin{aligned} \frac{\partial}{\partial a}\mu(1,a) &= \frac{\partial}{\partial a}\left(\beta_0 + \beta_M\cdot1 + \beta_A\cdot a + {\color{red}\beta_{AM}}\cdot 1 \cdot a\right)\\ &= \frac{\partial}{\partial a}\left(\beta_0 + \beta_M + \beta_A\cdot a + {\color{red}\beta_{AM}} \cdot a\right)\\ &= \beta_A + {\color{red}\beta_{AM}} \end{aligned} \] or \[ \begin{aligned} E[Y|1,a+1] - E[Y|1,a] = &\beta_0 + \beta_M 1 + \beta_A(a+1) + {\color{red}\beta_{AM}}1(a+1)\\ &- (\beta_0 + \beta_M 1 + \beta_A(a) + {\color{red}\beta_{AM}}1(a))\\ = &\beta_A + {\color{red}\beta_{AM}} \end{aligned} \]

Slope among females: \[ \begin{aligned} \frac{\partial}{\partial a}\mu(0,a) &= \frac{\partial}{\partial a}\left(\beta_0 + \beta_M\cdot0 + \beta_A\cdot a + {\color{red}\beta_{AM}}\cdot 0 \cdot a\right)\\ &= \frac{\partial}{\partial a}\left(\beta_0 + \beta_A\cdot a\right)\\ &= \beta_A \\ \end{aligned} \] or \[ \begin{aligned} E[Y|0,a+1] - E[Y|0,a] = &\beta_0 + \beta_M0 + \beta_A(a+1) + {\color{red}\beta_{AM}}0(a+1) \\ &- (\beta_0 + \beta_M 0 + \beta_A(a) + {\color{red}\beta_{AM}}0(a))\\ = &\beta_0 + \beta_A(a+1) - (\beta_0 + \beta_A(a))\\ = &\beta_A \end{aligned} \]

Difference in slopes: \[ \begin{aligned} \frac{\partial}{\partial a}\mu(1,a) - \frac{\partial}{\partial a}\mu(0,a) &= \beta_A + {\color{red}\beta_{AM}} - \beta_A\\ &= {\color{red}\beta_{AM}} \end{aligned} \] or \[ \begin{aligned} (E[Y|1,a+1] - E[Y|1,a]) - (E[Y|0,a+1] - E[Y|0,a]) &= \beta_A + {\color{red}\beta_{AM}} - \beta_A\\ &= {\color{red}\beta_{AM}} \end{aligned} \]

Therefore \[ \begin{aligned} \beta_{AM} = &\frac{\partial}{\partial a}\mu(1,a) - \frac{\partial}{\partial a}\mu(0,a)\\ = &(E[Y|M = 1,A = a+1] - E[Y|M = 1,A = a])\\ &- (E[Y|M = 0,A = a+1] - E[Y|M = 0,A = a]) \end{aligned} \]

\(\beta_{AM}\) is the difference in slope of mean birthweight with respect to gestational age between males and females.

Compare coefficient interpretations

Table 5: Coefficient interpretations, by model structure
\(\mu(m,a)\) \(\beta_0 + \beta_M m + \beta_A a\) \(\beta_0 + \beta_M m + \beta_A a + {\color{red}\beta_{AM} ma}\)
\(\beta_0\) \(\mu(0,0)\) \(\mu(0,0)\)
\(\beta_A\) \(\frac{\partial}{\partial a}\mu({\color{blue}m},a)\) \(\frac{\partial}{\partial a}\mu({\color{red}0},a)\)
\(\beta_M\) \(\mu(1, {\color{blue}a}) - \mu(0, {\color{blue}a})\) \(\mu(1, {\color{red}0}) - \mu(0,{\color{red}0})\)
\(\beta_{AM}\) \(\frac{\partial}{\partial a}\mu(1,a) - \frac{\partial}{\partial a}\mu(0,a)\)

Stratified regression

\[ \text{E}{\left[Y|A=a, S=s\right]} = \beta_M m + \beta_{AM} (a \cdot m) + \beta_F f + \beta_{AF} (a \cdot f) \tag{3}\]

Compare this stratified model (Equation 3) with our interaction model, Equation 2:

\[ \text{E}{\left[Y|A=a, S=s\right]} = \beta_0 + \beta_A a + \beta_M m + \beta_{AM} (a \cdot m) \]

[R code]
bw_lm2 <- lm(weight ~ sex + age + sex:age, data = bw)
bw_lm2 |>
  parameters() |>
  print_md(
    include_reference = include_reference_lines,
    select = "{estimate}"
  )
Table 6: Birthweight model with interaction term
Parameter Coefficient
(Intercept) -2141.67
sex (male) 872.99
age 130.40
sex (male) × age -18.42
[R code]
bw_lm_strat <-
  bw |>
  lm(
    formula = weight ~ sex + sex:age - 1,
    data = _
  )

bw_lm_strat |>
  parameters() |>
  print_md(
    select = "{estimate}"
  )
Table 7: Birthweight model - stratified betas
Parameter Coefficient
sex (female) -2141.67
sex (male) -1268.67
sex (female) × age 130.40
sex (male) × age 111.98

Curved-line regression

[R code]
bw_lm3 <- lm(weight ~ sex:log(age) - 1, data = bw)

ggbw <-
  bw |>
  ggplot(
    aes(x = age, y = weight)
  ) +
  geom_point() +
  xlab("Gestational Age (weeks)") +
  ylab("Birth Weight (g)")

ggbw2 <- ggbw +
  stat_smooth(
    method = "lm",
    formula = y ~ log(x),
    geom = "smooth"
  ) +
  xlab("Gestational Age (weeks)") +
  ylab("Birth Weight (g)")


ggbw2 |> print()
Figure 4: birthweight model with age entering on log scale
[R code]
library(palmerpenguins)

ggpenguins <-
  palmerpenguins::penguins |>
  dplyr::filter(species == "Adelie") |>
  ggplot(
    aes(x = bill_length_mm, y = body_mass_g)
  ) +
  geom_point() +
  xlab("Bill length (mm)") +
  ylab("Body mass (g)")

ggpenguins2 <- ggpenguins +
  stat_smooth(
    method = "lm",
    formula = y ~ log(x),
    geom = "smooth"
  ) +
  xlab("Bill length (mm)") +
  ylab("Body mass (g)")


ggpenguins2 |> print()
Figure 5: palmerpenguins model with bill_length entering on log scale

1.3 Estimating Linear Models via Maximum Likelihood

Likelihood

\[ \begin{aligned} \mathscr{L}_i &\stackrel{\text{def}}{=}p(Y_i=y_i|\tilde{X}_i = \tilde{x}_i) \\ &= (2\pi\sigma^2)^{-1/2} \text{exp}{\left\{-\frac{1}{2\sigma^2}\varepsilon_i^2\right\}} \\ \varepsilon_i &\stackrel{\text{def}}{=}y_i - \mu_i \\ \mu_i &\stackrel{\text{def}}{=}\mu(x_i) \\ &= x_i \cdot \beta \end{aligned} \]

\[ \begin{aligned} \mathscr{L}&\stackrel{\text{def}}{=}\mathscr{L}(\tilde{y}|\mathbf{x}, \tilde{\beta}, \sigma^2) \\ &\stackrel{\text{def}}{=}\text{p}(\tilde{Y}=\tilde{y}| \mathbf{X}= \mathbf{x}) \\ &= \prod_{i=1}^n \mathscr{L}_i \end{aligned} \tag{4}\]

Log-likelihood

\[ \begin{aligned} \ell_i &\stackrel{\text{def}}{=}\text{log}{\left\{\mathscr{L}_i\right\}} \\ &= \text{log}{\left\{(2\pi\sigma^2)^{-1/2} \text{exp}{\left\{-\frac{1}{2\sigma^2}\varepsilon_i^2\right\}}\right\}} \\ &= -\frac{1}{2}\text{log}{\left\{2\pi\sigma^2\right\}} -\frac{1}{2\sigma^2}\varepsilon_i^2 \end{aligned} \]

\[ \begin{aligned} \ell &\stackrel{\text{def}}{=}\ell(\tilde{y}|\mathbf{x},\beta, \sigma^2) \\ &\stackrel{\text{def}}{=}\text{log}{\left\{\mathscr{L}(\tilde{y}|\mathbf{x},\beta, \sigma^2)\right\}} \\ &= \text{log}{\left\{\prod_{i=1}^n\mathscr{L}_i\right\}} \\ &= \sum_{i=1}^n\text{log}{\left\{\mathscr{L}_i\right\}} \\ &= \sum_{i=1}^n\ell_i \\ &= \sum_{i=1}^n{\left(-\frac{1}{2}\text{log}{\left\{2\pi\sigma^2\right\}} -\frac{1}{2\sigma^2}\varepsilon_i^2\right)} \\ &= -\frac{n}{2}\text{log}{\left\{2\pi\sigma^2\right\}} - \frac{1}{2\sigma^2}\sum_{i=1}^n \varepsilon_i^2 \\ &= -\frac{n}{2}\text{log}{\left\{2\pi\sigma^2\right\}} - \frac{1}{2\sigma^2}{\left(\tilde{\varepsilon}\cdot \tilde{\varepsilon}\right)} \\ &= -\frac{n}{2}\text{log}{\left\{2\pi\sigma^2\right\}} - \frac{1}{2\sigma^2}{\left((\tilde{y}- \tilde{\mu}) \cdot (\tilde{y}- \tilde{\mu})\right)} \\ &= -\frac{n}{2}\text{log}{\left\{2\pi\sigma^2\right\}} - \frac{1}{2\sigma^2}{\left((\tilde{y}- \mathbf{X}\tilde{\beta}) \cdot (\tilde{y}- \mathbf{X}\tilde{\beta})\right)} \\ &= -\frac{n}{2}\text{log}{\left\{2\pi\sigma^2\right\}} - \frac{1}{2\sigma^2}\sum_{i=1}^n {\left(y_i - {\left(\tilde{x}_i\cdot \tilde{\beta}\right)}\right)}^2 \end{aligned} \tag{5}\]

Score function

\[ \begin{aligned} \mu_i' &\stackrel{\text{def}}{=}\frac{\partial}{\partial \tilde{\beta}} \mu_i \\ &= \frac{\partial}{\partial \tilde{\beta}} {\left(\tilde{x}_i \cdot \tilde{\beta}\right)} \\ &= {\left( \frac{\partial}{\partial \tilde{\beta}} \tilde{\beta}\right)} \tilde{x}_i \\ &= \mathbb{I}\tilde{x}_i \\ &= \tilde{x}_i \end{aligned} \]

\[ \begin{aligned} \varepsilon'_i &\stackrel{\text{def}}{=}\frac{\partial}{\partial \tilde{\beta}}\varepsilon_i \\ &= \frac{\partial}{\partial \tilde{\beta}} (y_i - \mu_i) \\ &= \frac{\partial}{\partial \tilde{\beta}}y_i - \frac{\partial}{\partial \tilde{\beta}} \mu_i \\ &= 0 - \tilde{x}_i \\ &= - \tilde{x}_i \end{aligned} \]

\[ \begin{aligned} \ell'_i &\stackrel{\text{def}}{=}\frac{\partial}{\partial \tilde{\beta}}\ell_i \\ &= \frac{\partial}{\partial \tilde{\beta}} {\left(-\frac{1}{2}\text{log}{\left\{2\pi\sigma^2\right\}} -\frac{1}{2\sigma^2}\varepsilon_i^2\right)} \\ &= \frac{\partial}{\partial \tilde{\beta}}{\left(-\frac{1}{2}\text{log}{\left\{2\pi\sigma^2\right\}}\right)} - \frac{\partial}{\partial \tilde{\beta}}\frac{1}{2\sigma^2}\varepsilon_i^2 \\ &= 0 - \frac{1}{2\sigma^2}\frac{\partial}{\partial \tilde{\beta}}\varepsilon_i^2 \\ &= - \frac{1}{2\sigma^2}2 {\left(\varepsilon_i'\right)} \varepsilon_i \\ &= - \frac{1}{\sigma^2} {\left(- \tilde{x}_i \varepsilon_i\right)} \\ &= \frac{1}{\sigma^2} \tilde{x}_i \varepsilon_i \end{aligned} \]

\[ \begin{aligned} \ell'_{\tilde{\beta}} &\stackrel{\text{def}}{=}\frac{\partial}{\partial \tilde{\beta}}\ell_{\tilde{\beta}} \\ &= \frac{\partial}{\partial \tilde{\beta}} \sum_{i=1}^n\ell_i \\ &= \sum_{i=1}^n\frac{\partial}{\partial \tilde{\beta}} \ell_i \\ &= \sum_{i=1}^n\ell_i' \\ &= \sum_{i=1}^n\frac{1}{\sigma^2} \tilde{x}_i \varepsilon_i \\ &= \frac{1}{\sigma^2} \sum_{i=1}^n\tilde{x}_i \varepsilon_i \\ &= \frac{1}{\sigma^2} \mathbf{X}^{\top} \tilde{\varepsilon} \end{aligned} \]

Hessian

\[ \begin{aligned} \ell_i'' &\stackrel{\text{def}}{=}\frac{\partial}{\partial \tilde{\beta}^{\top}}\frac{\partial}{\partial \tilde{\beta}} \ell_i \\ &= \frac{\partial}{\partial \tilde{\beta}^{\top}} \ell_i' \\ &= \frac{\partial}{\partial \tilde{\beta}^{\top}} {\left(\frac{1}{\sigma^2} \tilde{x}_i \varepsilon_i\right)} \\ &= \frac{1}{\sigma^2} \tilde{x}_i \varepsilon_i'^{\top} \\ &= \frac{1}{\sigma^2} \tilde{x}_i (-\tilde{x}_i^{\top}) \\ &= -\frac{1}{\sigma^2} \tilde{x}_i \tilde{x}_i^{\top} \end{aligned} \]

\[ \begin{aligned} \ell'' &\stackrel{\text{def}}{=}\frac{\partial}{\partial \tilde{\beta}^{\top}}\frac{\partial}{\partial \tilde{\beta}} \ell \\ &= \frac{\partial}{\partial \tilde{\beta}^{\top}} \ell' \\ &= \frac{\partial}{\partial \tilde{\beta}^{\top}} \sum_{i=1}^n\ell_i' \\ &= \sum_{i=1}^n\frac{\partial}{\partial \tilde{\beta}^{\top}} \ell_i' \\ &= \sum_{i=1}^n\ell_i'' \\ &= \sum_{i=1}^n-\frac{1}{\sigma^2} \tilde{x}_i \tilde{x}_i^{\top} \\ &= -\frac{1}{\sigma^2} \sum_{i=1}^n\tilde{x}_i \tilde{x}_i^{\top} \\ &= -\frac{1}{\sigma^2} \mathbf{X}^{\top}\mathbf{X} \end{aligned} \] That is,

\[\ell''= -\frac{1}{\sigma^2} \sum_{i=1}^n\tilde{x}_i \tilde{x}_i^{\top} \tag{6}\]

Alternative approach using matrix derivatives

\[ \begin{aligned} \ell'_{\tilde{\beta}}(\tilde{y}|\mathbf{x}, \tilde{\beta}, \sigma^2) &\stackrel{\text{def}}{=}\frac{\partial}{\partial \tilde{\beta}}\ell_{\tilde{\beta}}(\tilde{y}|\mathbf{x}, \tilde{\beta}, \sigma^2) \\ &= - \frac{1}{2\sigma^2}\frac{\partial}{\partial \tilde{\beta}} {\left(\sum_{i=1}^n {\left(y_i - {\left(\tilde{x}_i\cdot \tilde{\beta}\right)}\right)}^2\right)} \end{aligned} \tag{7}\]

\[ \sum_{i=1}^n (y_i - \tilde{x}_i^{\top} \tilde{\beta})^2 = (\tilde{y}- \mathbf{X}\tilde{\beta}) \cdot (\tilde{y}- \mathbf{X}\tilde{\beta}) \]

So

\[ \begin{aligned} (\tilde{y}- \mathbf{X}\tilde{\beta})'(\tilde{y}- \mathbf{X}\tilde{\beta}) &= (\tilde{y}' - \tilde{\beta}'X')(\tilde{y}- \mathbf{X}\tilde{\beta}) \\ &= \tilde{y}'\tilde{y}- \tilde{\beta}'X'\tilde{y}- \tilde{y}'\mathbf{X}\tilde{\beta}+\tilde{\beta}'\mathbf{X}'\mathbf{X}\beta \\ &= \tilde{y}'\tilde{y}- 2\tilde{y}'\mathbf{X}\beta +\beta'\mathbf{X}'\mathbf{X}\beta \end{aligned} \]

\[ \begin{aligned} \frac{\partial}{\partial \tilde{\beta}}{\left(\sum_{i=1}^n (y_i - x_i' \beta)^2\right)} &= \frac{\partial}{\partial \tilde{\beta}}(\tilde{y}- X\beta)'(\tilde{y}- X\beta) \\ &= \frac{\partial}{\partial \tilde{\beta}} (y'y - 2y'X\beta +\beta'\mathbf{X}'\mathbf{X}\beta) \\ &= (- 2X'y +2\mathbf{X}'\mathbf{X}\beta) \\ &= - 2X'(y - X\beta) \\ &= - 2X'(y - \text{E}[y]) \\ &= - 2X' \varepsilon(y) \end{aligned} \tag{8}\]

So if \(\ell'(\beta,\sigma^2) = 0\), then

\[ \begin{aligned} 0 &= (- 2X'y +2\mathbf{X}'\mathbf{X}\beta)\\ 2X'y &= 2\mathbf{X}'\mathbf{X}\beta\\ X'y &= \mathbf{X}'\mathbf{X}\beta\\ (\mathbf{X}'\mathbf{X})^{-1}X'y &= \beta \end{aligned} \]

Hessian

The Hessian (second derivative matrix) is:

\[ \begin{aligned} \ell_{\beta, \beta'} ''(\beta, \sigma^2;\tilde{y}, \mathbf{X}) &= -\frac{1}{2\sigma^2}\mathbf{X}'\mathbf{X} \end{aligned} \]

\(\ell_{\beta, \beta'} ''(\beta, \sigma^2;\mathbf X,\tilde{y})\) is negative definite at \(\beta = (\mathbf{X}'\mathbf{X})^{-1}X'y\), so \(\hat \beta_{ML} = (\mathbf{X}'\mathbf{X})^{-1}X'y\) is the MLE for \(\beta\).

Similarly (not shown):

\[ \hat\sigma^2_{ML} = \frac{1}{n} (Y-X\hat\beta)'(Y-X\hat\beta) \]

And

\[ \begin{aligned} \mathcal I_{\beta} &= E[-\ell_{\beta, \beta'} ''(Y|X,\beta, \sigma^2)]\\ &= \frac{1}{\sigma^2}\mathbf{X}'\mathbf{X} \end{aligned} \]

So:

\[ Var(\hat \beta) \approx (\mathcal I_{\beta})^{-1} = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1} \]

and

\[ \hat\beta \dot \sim N(\beta, \mathcal I_{\beta}^{-1}) \]

In the Gaussian linear regression case, we also have exact results:

\[ \frac{\hat\beta_j}{\widehat{\text{se}}{\left(\hat\beta_j\right)}} \ \sim \ t_{n-p} \]

Example 1 (MLEs for birthweight data) In model 2 above, \(\hat{\mathcal{I}}(\beta)\) is:

[R code]
bw_lm2 |> vcov()
#>             (Intercept)  sexmale        age sexmale:age
#> (Intercept)     1353968 -1353968 -34870.966   34870.966
#> sexmale        -1353968  2596387  34870.966  -67210.974
#> age              -34871    34871    899.896    -899.896
#> sexmale:age       34871   -67211   -899.896    1743.548
Table 8: Covariance matrix of \(\hat{\tilde{\beta}}\) for birthweight model 2 (with interaction term)
[R code]
bw_lm2 |>
  vcov() |>
  diag() |>
  sqrt()
#> (Intercept)     sexmale         age sexmale:age 
#>   1163.6015   1611.3309     29.9983     41.7558
[R code]
bw_lm2 |>
  parameters() |>
  print_md()
Table 9: Estimated model for birthweight data with interaction term
Parameter Coefficient SE 95% CI t(20) p
(Intercept) -2141.67 1163.60 (-4568.90, 285.56) -1.84 0.081
sex (male) 872.99 1611.33 (-2488.18, 4234.17) 0.54 0.594
age 130.40 30.00 (67.82, 192.98) 4.35 < .001
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664

Residual Standard Deviation

[R code]
sigma(bw_lm2)
#> [1] 180.613

\(\sigma\) is NOT “Residual standard error”

[R code]
summary(bw_lm2)
#> 
#> Call:
#> lm(formula = weight ~ sex + age + sex:age, data = bw)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -246.7 -138.1  -39.1  176.6  274.3 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -2141.7     1163.6   -1.84  0.08057 .  
#> sexmale        873.0     1611.3    0.54  0.59395    
#> age            130.4       30.0    4.35  0.00031 ***
#> sexmale:age    -18.4       41.8   -0.44  0.66389    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 181 on 20 degrees of freedom
#> Multiple R-squared:  0.643,  Adjusted R-squared:  0.59 
#> F-statistic:   12 on 3 and 20 DF,  p-value: 0.000101

1.4 Inference about Gaussian Linear Regression Models

Motivating example: birthweight data

Research question: is there really an interaction between sex and age?

\(H_0: \beta_{AM} = 0\)

\(H_A: \beta_{AM} \neq 0\)

\(P(|\hat\beta_{AM}| > |-18.417241| \mid H_0)\) = ?

Wald tests and CIs

R can give you Wald tests for single coefficients and corresponding CIs:

[R code]
bw_lm2 |>
  parameters() |>
  print_md(
    include_reference = TRUE
  )
Parameter Coefficient SE 95% CI t(20) p
(Intercept) -2141.67 1163.60 (-4568.90, 285.56) -1.84 0.081
sex (female) 0.00
sex (male) 872.99 1611.33 (-2488.18, 4234.17) 0.54 0.594
age 130.40 30.00 (67.82, 192.98) 4.35 < .001
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664

To understand what’s happening, let’s replicate these results by hand for the interaction term.

P-values

[R code]
bw_lm2 |>
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE
  )
Parameter Coefficient SE 95% CI t(20) p
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664
[R code]
beta_hat <- coef(summary(bw_lm2))["sexmale:age", "Estimate"]
se_hat <- coef(summary(bw_lm2))["sexmale:age", "Std. Error"]
dfresid <- bw_lm2$df.residual
t_stat <- abs(beta_hat) / se_hat
pval_t <-
  pt(-t_stat, df = dfresid, lower.tail = TRUE) +
  pt(t_stat, df = dfresid, lower.tail = FALSE)

\[ \begin{aligned} &P{\left( | \hat \beta_{AM} | > | -18.417241| \middle| H_0 \right)} \\ &= \Pr {\left( \left| \frac{\hat\beta_{AM}}{\hat{SE}(\hat\beta_{AM})} \right| > \left| \frac{-18.417241}{41.755817} \right| \middle| H_0 \right)}\\ &= \Pr {\left( \left| T_{20} \right| > 0.44107 | H_0 \right)}\\ &= 0.663893 \end{aligned} \]

Confidence intervals

[R code]
bw_lm2 |>
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE
  )
Parameter Coefficient SE 95% CI t(20) p
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664
[R code]
q_t <- qt(
  p = 0.975,
  df = dfresid,
  lower.tail = TRUE
)

q_t <- qt(
  p = 0.025,
  df = dfresid,
  lower.tail = TRUE
)


confint_radius_t <-
  se_hat * q_t

confint_t <- beta_hat + c(-1, 1) * confint_radius_t

print(confint_t)
#> [1]   68.6839 -105.5184

Gaussian approximations

Here are the asymptotic (Gaussian approximation) equivalents:

P-values

[R code]
bw_lm2 |>
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE
  )
Parameter Coefficient SE 95% CI t(20) p
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664
[R code]
pval_z <- pnorm(abs(t_stat), lower = FALSE) * 2

print(pval_z)
#> [1] 0.659162

Confidence intervals

[R code]
bw_lm2 |>
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE
  )
Parameter Coefficient SE 95% CI t(20) p
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664
[R code]
confint_radius_z <- se_hat * qnorm(0.975, lower = TRUE)
confint_z <-
  beta_hat + c(-1, 1) * confint_radius_z
print(confint_z)
#> [1] -100.2571   63.4227

Likelihood ratio statistics

[R code]
logLik(bw_lm2)
#> 'log Lik.' -156.579 (df=5)
logLik(bw_lm1)
#> 'log Lik.' -156.695 (df=4)

log_LR <- (logLik(bw_lm2) - logLik(bw_lm1)) |> as.numeric()
delta_df <- (bw_lm1$df.residual - df.residual(bw_lm2))


x_max <- 1
[R code]
d_log_LR <- function(x, df = delta_df) dchisq(x, df = df)

chisq_plot <-
  ggplot() +
  geom_function(fun = d_log_LR) +
  stat_function(
    fun = d_log_LR,
    xlim = c(log_LR, x_max),
    geom = "area",
    fill = "gray"
  ) +
  geom_segment(
    aes(
      x = log_LR,
      xend = log_LR,
      y = 0,
      yend = d_log_LR(log_LR)
    ),
    col = "red"
  ) +
  xlim(0.0001, x_max) +
  ylim(0, 4) +
  ylab("p(X=x)") +
  xlab("log(likelihood ratio) statistic [x]") +
  theme_classic()
chisq_plot |> print()
Figure 6: Chi-square distribution

Now we can get the p-value:

[R code]
pchisq(
  q = 2 * log_LR,
  df = delta_df,
  lower = FALSE
) |>
  print()
#> [1] 0.629806

In practice you don’t have to do this by hand; there are functions to do it for you:

[R code]
# built in
library(lmtest)
lrtest(bw_lm2, bw_lm1)

1.5 Goodness of fit

AIC and BIC

  • AIC = \(-2 * \ell(\hat\theta) + 2 * p\)

  • BIC = \(-2 * \ell(\hat\theta) + p * \text{log}(n)\)

where \(\ell\) is the log-likelihood of the data evaluated using the parameter estimates \(\hat\theta\), \(p\) is the number of estimated parameters in the model (including \(\hat\sigma^2\)), and \(n\) is the number of observations.

You can calculate these criteria using the logLik() function, or use the built-in R functions:

AIC in R

[R code]
-2 * logLik(bw_lm2) |> as.numeric() +
  2 * (length(coef(bw_lm2)) + 1) # sigma counts as a parameter here
#> [1] 323.159

AIC(bw_lm2)
#> [1] 323.159

BIC in R

[R code]
-2 * logLik(bw_lm2) |> as.numeric() +
  (length(coef(bw_lm2)) + 1) * log(nobs(bw_lm2))
#> [1] 329.049

BIC(bw_lm2)
#> [1] 329.049

Large values of AIC and BIC are worse than small values. There are no hypothesis tests or p-values associated with these criteria.

(Residual) Deviance

Let \(q\) be the number of distinct covariate combinations in a data set.

[R code]
bw_X_unique <-
  bw |>
  count(sex, age)

n_unique_bw <- nrow(bw_X_unique)

For example, in the birthweight data, there are \(q = 12\) unique patterns (Table 10).

[R code]
bw_X_unique
Table 10: Unique covariate combinations in the birthweight data, with replicate counts

Definition 1 (Replicates) If a given covariate pattern has more than one observation in a dataset, those observations are called replicates.

Example 2 (Replicates in the birthweight data) In the birthweight dataset, there are 2 replicates of the combination “female, age 36” (Table 10).

Exercise 12 (Replicates in the birthweight data) Which covariate pattern(s) in the birthweight data has the most replicates?

Solution 1 (Replicates in the birthweight data). Two covariate patterns are tied for most replicates: males at age 40 weeks and females at age 40 weeks. 40 weeks is the usual length for human pregnancy (Polin, Fox, and Abman (2011)), so this result makes sense.

[R code]
bw_X_unique |> dplyr::filter(n == max(n))

Saturated models

The most complicated model we could fit would have one parameter (a mean) for each covariate pattern, plus a variance parameter:

[R code]
lm_max <-
  bw |>
  mutate(age = factor(age)) |>
  lm(
    formula = weight ~ sex:age - 1,
    data = _
  )

lm_max |>
  parameters() |>
  print_md()
Table 11: Saturated model for the birthweight data
Parameter Coefficient SE 95% CI t(12) p
sex (male) × age35 2925.00 187.92 (2515.55, 3334.45) 15.56 < .001
sex (female) × age36 2570.50 132.88 (2280.98, 2860.02) 19.34 < .001
sex (male) × age36 2625.00 187.92 (2215.55, 3034.45) 13.97 < .001
sex (female) × age37 2539.00 187.92 (2129.55, 2948.45) 13.51 < .001
sex (male) × age37 2737.50 132.88 (2447.98, 3027.02) 20.60 < .001
sex (female) × age38 2872.50 132.88 (2582.98, 3162.02) 21.62 < .001
sex (male) × age38 2982.00 108.50 (2745.60, 3218.40) 27.48 < .001
sex (female) × age39 2846.00 132.88 (2556.48, 3135.52) 21.42 < .001
sex (female) × age40 3152.25 93.96 (2947.52, 3356.98) 33.55 < .001
sex (male) × age40 3256.25 93.96 (3051.52, 3460.98) 34.66 < .001
sex (male) × age41 3292.00 187.92 (2882.55, 3701.45) 17.52 < .001
sex (female) × age42 3210.00 187.92 (2800.55, 3619.45) 17.08 < .001

We call this model the full, maximal, or saturated model for this dataset.

[R code]
library(rlang) # defines the `.data` pronoun
plot_PIs_and_CIs <- function(model, data) {
  cis <- model |>
    predict(interval = "confidence") |>
    suppressWarnings() |>
    tibble::as_tibble()
  names(cis) <- paste("ci", names(cis), sep = "_")

  preds <- model |>
    predict(interval = "predict") |>
    suppressWarnings() |>
    tibble::as_tibble()
  names(preds) <- paste("pred", names(preds), sep = "_")
  dplyr::bind_cols(bw, cis, preds) |>
    ggplot2::ggplot() +
    ggplot2::aes(
      x = .data$age, 
      y = .data$weight, 
      col = .data$sex
    ) +
    ggplot2::geom_point() +
    ggplot2::theme(legend.position = "bottom") +
    ggplot2::geom_line(ggplot2::aes(y = .data$ci_fit)) +
    ggplot2::geom_ribbon(
      ggplot2::aes(
        ymin = .data$pred_lwr,
        ymax = .data$pred_upr
      ),
      alpha = 0.2
    ) +
    ggplot2::geom_ribbon(
      ggplot2::aes(
        ymin = .data$ci_lwr,
        ymax = .data$ci_upr
      ),
      alpha = 0.5
    ) +
    ggplot2::facet_wrap(~sex)
}
Figure 7: Model 2 and saturated model for birthweight data, with confidence and prediction intervals
[R code]
plot_PIs_and_CIs(bw_lm2, bw)
(a) Model 2 (linear with age:sex interaction)

[R code]
plot_PIs_and_CIs(lm_max, bw)
(b) Saturated model

We can calculate the log-likelihood of this model as usual:

[R code]
logLik(lm_max)
#> 'log Lik.' -151.402 (df=13)

We can compare this model to our other models using chi-square tests, as usual:

[R code]
lrtest(lm_max, bw_lm2)

The likelihood ratio statistic for this test is \[\lambda = 2 * (\ell_{\text{full}} - \ell) = 10.355374\] where:

  • \(\ell_{\text{full}}\) is the log-likelihood of the full model: -151.401601
  • \(\ell\) is the log-likelihood of our comparison model (two slopes, two intercepts): -156.579288

This statistic is called the deviance or residual deviance for our two-slopes and two-intercepts model; it tells us how much the likelihood of that model deviates from the likelihood of the maximal model.

The corresponding p-value tells us whether there we have enough evidence to detect that our two-slopes, two-intercepts model is a worse fit for the data than the maximal model; in other words, it tells us if there’s evidence that we missed any important patterns. (Remember, a nonsignificant p-value could mean that we didn’t miss anything and a more complicated model is unnecessary, or it could mean we just don’t have enough data to tell the difference between these models.)

Null Deviance

Similarly, the least complicated model we could fit would have only one mean parameter, an intercept:

\[\text E[Y|X=x] = \beta_0\] We can fit this model in R like so:

[R code]
lm0 <- lm(weight ~ 1, data = bw)

lm0 |>
  parameters() |>
  print_md()
Parameter Coefficient SE 95% CI t(23) p
(Intercept) 2967.67 57.58 (2848.56, 3086.77) 51.54 < .001
[R code]
lm0 |> plot_PIs_and_CIs()
Figure 8: Null model for birthweight data, with 95% confidence and prediction intervals.

This model also has a likelihood:

[R code]
logLik(lm0)
#> 'log Lik.' -168.955 (df=2)

And we can compare it to more complicated models using a likelihood ratio test:

[R code]
lrtest(bw_lm2, lm0)

The likelihood ratio statistic for the test comparing the null model to the maximal model is \[\lambda = 2 * (\ell_{\text{full}} - \ell_{0}) = 35.106732\] where:

  • \(\ell_{\text{0}}\) is the log-likelihood of the null model: -168.954967
  • \(\ell_{\text{full}}\) is the log-likelihood of the maximal model: -151.401601

In R, this test is:

[R code]
lrtest(lm_max, lm0)

This log-likelihood ratio statistic is called the null deviance. It tells us whether we have enough data to detect a difference between the null and full models.

1.6 Rescaling

Rescale age

[R code]
bw <-
  bw |>
  mutate(
    `age - mean` = age - mean(age),
    `age - 36wks` = age - 36
  )

lm1_c <- lm(weight ~ sex + `age - 36wks`, data = bw)

lm2_c <- lm(weight ~ sex + `age - 36wks` + sex:`age - 36wks`, data = bw)

parameters(lm2_c, ci_method = "wald") |> print_md()
Parameter Coefficient SE 95% CI t(20) p
(Intercept) 2552.73 97.59 (2349.16, 2756.30) 26.16 < .001
sex (male) 209.97 129.75 (-60.68, 480.63) 1.62 0.121
age - 36wks 130.40 30.00 (67.82, 192.98) 4.35 < .001
sex (male) × age - 36wks -18.42 41.76 (-105.52, 68.68) -0.44 0.664

Compare with what we got without rescaling:

[R code]
parameters(bw_lm2, ci_method = "wald") |> print_md()
Parameter Coefficient SE 95% CI t(20) p
(Intercept) -2141.67 1163.60 (-4568.90, 285.56) -1.84 0.081
sex (male) 872.99 1611.33 (-2488.18, 4234.17) 0.54 0.594
age 130.40 30.00 (67.82, 192.98) 4.35 < .001
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664

1.7 Prediction

Prediction for linear models

Definition 2 (Predicted value) In a regression model \(\text{p}(y|\tilde{x})\), the predicted value of \(y\) given \(\tilde{x}\) is the estimated mean of \(Y\) given \(\tilde{X}=\tilde{x}\):

\[\hat y \stackrel{\text{def}}{=}\hat{\text{E}}{\left[Y|\tilde{X}=\tilde{x}\right]}\]

For linear models, the predicted value can be straightforwardly calculated by multiplying each predictor value \(x_j\) by its corresponding coefficient \(\beta_j\) and adding up the results:

\[ \begin{aligned} \hat y &= \hat{\text{E}}{\left[Y|\tilde{X}=\tilde{x}\right]} \\ &= \tilde{x}'\hat\beta \\ &= \hat\beta_0\cdot 1 + \hat\beta_1 x_1 + ... + \hat\beta_p x_p \end{aligned} \]

Example: prediction for the birthweight data

[R code]
x <- c(1, 1, 40)
sum(x * coef(bw_lm1))
#> [1] 3225.49

R has built-in functions for prediction:

[R code]
x <- tibble(age = 40, sex = "male")
bw_lm1 |> predict(newdata = x)
#>       1 
#> 3225.49

If you don’t provide newdata, R will use the covariate values from the original dataset:

[R code]
predict(bw_lm1)
#>       1       2       3       4       5       6       7       8       9      10 
#> 3225.49 3062.45 2983.70 2578.87 3225.49 3062.45 2621.02 2820.66 2741.91 3304.24 
#>      11      12      13      14      15      16      17      18      19      20 
#> 2862.81 2941.56 3346.38 3062.45 3225.49 2699.77 2862.81 2578.87 2983.70 2820.66 
#>      21      22      23      24 
#> 3225.49 2941.56 2983.70 3062.45

These special predictions are called the fitted values of the dataset:

Definition 3 For a given dataset \((\tilde{Y}, \mathbf{X})\) and corresponding fitted model \(\text{p}_{\hat \beta}(\tilde{y}|\mathbf{x})\), the fitted value of \(y_i\) is the predicted value of \(y\) when \(\tilde{X}=\tilde{x}_i\) using the estimate parameters \(\hat \beta\).

R has an extra function to get these values:

[R code]
fitted(bw_lm1)
#>       1       2       3       4       5       6       7       8       9      10 
#> 3225.49 3062.45 2983.70 2578.87 3225.49 3062.45 2621.02 2820.66 2741.91 3304.24 
#>      11      12      13      14      15      16      17      18      19      20 
#> 2862.81 2941.56 3346.38 3062.45 3225.49 2699.77 2862.81 2578.87 2983.70 2820.66 
#>      21      22      23      24 
#> 3225.49 2941.56 2983.70 3062.45

Confidence intervals

Use predict(se.fit = TRUE) to compute SEs for predicted values:

[R code]
bw_lm1 |>
  predict(
    newdata = x,
    se.fit = TRUE
  )
#> $fit
#>       1 
#> 3225.49 
#> 
#> $se.fit
#> [1] 61.4599
#> 
#> $df
#> [1] 21
#> 
#> $residual.scale
#> [1] 177.116

The output of predict.lm(se.fit = TRUE) is a list(); you can extract the elements with $ or magrittr::use_series():

[R code]
library(magrittr)
bw_lm1 |>
  predict(
    newdata = x,
    se.fit = TRUE
  ) |>
  use_series(se.fit)
#> [1] 61.4599

We can construct confidence intervals for \(\text{E}{\left[Y|X=x\right]}\) using the usual formula:

\[\mu(\tilde{x}) \in {\left({\color{red}\hat{\mu}(\tilde{x})} \pm {\color{blue}\zeta_\alpha}\right)}\]

\[ {\color{blue}\zeta_\alpha} = t_{n-p}{\left(1-\frac{\alpha}{2}\right)} * \widehat{\text{se}}{\left(\hat{\mu}(\tilde{x})\right)} \]

\[{\color{red}\hat{\mu}(\tilde{x})} = \tilde{x}\cdot \hat \beta\]

\[\text{se}{\left(\hat{\mu}(\tilde{x})\right)} = \sqrt{\text{Var}{\left(\hat{\mu}(\tilde{x})\right)}}\] \[ \begin{aligned} \text{Var}{\left(\hat{\mu}(\tilde{x})\right)} &= \text{Var}{\left(x'\hat{\beta}\right)} \\&= x' \text{Var}{\left(\hat{\beta}\right)} x \\&= x' \sigma^2(\mathbf{X}'\mathbf{X})^{-1} x \\&= \sigma^2 x' (\mathbf{X}'\mathbf{X})^{-1} x \\&= \sum_{i=1}^n\sum_{j=1}^nx_i x_j \text{Cov}{\left(\hat \beta_i, \hat \beta_j\right)} \end{aligned} \] \[\widehat{\text{Var}}{\left(\hat{\mu}(\tilde{x})\right)} = \hat \sigma^2 x' (\mathbf{X}'\mathbf{X})^{-1} x\]

[R code]
bw_lm2 |> predict(
  newdata = x,
  interval = "confidence"
)
#>       fit     lwr     upr
#> 1 3210.64 3062.23 3359.05
[R code]
library(sjPlot)
bw_lm2 |>
  plot_model(type = "pred", terms = c("age", "sex"), show.data = TRUE) +
  theme_sjplot() +
  theme(legend.position = "bottom")
Figure 9: Predicted values and confidence bands for the birthweight model with interaction term

Prediction intervals

We can also construct prediction intervals for the value of a new observation \(Y^*\), given a covariate pattern \(\tilde{x}^*\):

\[ \begin{aligned} \hat{Y^*} &= \hat\mu({\tilde{x}}^*) + \hat{\epsilon}^* \\ \text{Var}{\left(\hat{Y}\right)} &= \text{Var}{\left(\hat\mu\right)} + \text{Var}{\left(\hat\epsilon\right)} \\ \text{Var}{\left(\hat{Y}\right)} &= \text{Var}{\left(x'\hat \beta\right)} + \text{Var}{\left(\hat\epsilon\right)} \\ &= {{\tilde{x}}^{*\top}} \text{Var}{\left(\hat \beta\right)} {{\tilde{x}}^*} + \sigma^2 \\ &= {{\tilde{x}}^{*\top}} {\left(\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\right)} {{\tilde{x}}^*} + \sigma^2 \\ &= \sigma^2 {{\tilde{x}}^{*\top}} (\mathbf{X}'\mathbf{X})^{-1} {{\tilde{x}}^*} + \sigma^2 \end{aligned} \tag{9}\]

[R code]
bw_lm2 |>
  predict(newdata = x, interval = "predict")
#>       fit     lwr     upr
#> 1 3210.64 2805.71 3615.57

If you don’t specify newdata, you get a warning:

[R code]
bw_lm2 |>
  predict(interval = "predict") |>
  head()
#> Warning in predict.lm(bw_lm2, interval = "predict"): predictions on current data refer to _future_ responses
#>       fit     lwr     upr
#> 1 2552.73 2124.50 2980.97
#> 2 2552.73 2124.50 2980.97
#> 3 2683.13 2275.99 3090.27
#> 4 2813.53 2418.60 3208.47
#> 5 2813.53 2418.60 3208.47
#> 6 2943.93 2551.48 3336.38

The warning from the last command is: “predictions on current data refer to future responses” (since you already know what happened to the current data, and thus don’t need to predict it).

See ?predict.lm for more.

[R code]
plot_PIs_and_CIs(bw_lm2, bw)
Figure 10: Confidence and prediction intervals for birthweight model 2

1.8 Diagnostics

Tip

This section is adapted from Dobson and Barnett (2018, secs. 6.2–6.3) and Dunn and Smyth (2018) Chapter 3.

Assumptions in linear regression models

\[Y_i|\tilde{X}_i \sim_{{\color{red}\perp\!\!\!\perp}} {\color{blue}\text{N}}({\color{orange}\mu_i}, {\color{green}\sigma^2})\] \[{\color{orange}\mu_i = \tilde{x}\cdot \tilde{\beta}}\]

  1. \({\color{blue}\text{Normality}}\)

The model assumes that the distribution conditional on a given \(X\) value is Gaussian.

  1. \({\color{orange}\text{Correct Functional Form}}\) of Conditional Mean Structure (Linear Component)

The model assumes that the conditional means have the structure: \[\text{E}[Y|\tilde{X}= \tilde{x}] = \tilde{x}'\tilde{\beta}\]

  1. \({\color{green}\text{Homoskedasticity}}\)

The model assumes that variance \(\sigma^2\) is constant (with respect to \(\tilde{x}\)).

  1. \({\color{red}\text{Independence}}\)

Direct visualization

[R code]
bw <-
  bw |>
  mutate(
    predlm2 = predict(bw_lm2)
  ) |>
  arrange(sex, age)

plot1_interact <-
  plot1 %+% bw +
  geom_line(aes(y = predlm2))

print(plot1_interact)
Figure 11: Birthweight model with interaction term

Fitted model for hers data

[R code]
hers <- fs::path_package("rme", "extdata/hersdata.dta") |> 
  haven::read_dta()
hers
Table 12: hers data
[R code]
hers_lm_with_int <- lm(
  na.action = na.exclude,
  LDL ~ smoking * age, data = hers
)


library(equatiomatic)
equatiomatic::extract_eq(hers_lm_with_int)

\[ \operatorname{LDL} = \alpha + \beta_{1}(\operatorname{smoking}) + \beta_{2}(\operatorname{age}) + \beta_{3}(\operatorname{smoking} \times \operatorname{age}) + \epsilon \]

[R code]
hers_lm_no_int <- lm(
  na.action = na.exclude,
  LDL ~ age + smoking:age - 1, data = hers
)

library(equatiomatic)
equatiomatic::extract_eq(hers_lm_no_int)

\[ \operatorname{LDL} = \beta_{1}(\operatorname{age}) + \beta_{2}(\operatorname{age} \times \operatorname{age}_{\operatorname{smoking}}) + \epsilon \]

Table 13: hers data models with and without intercepts
[R code]
library(gtsummary)
hers_lm_with_int |> 
  tbl_regression(intercept = TRUE)
(a) With intercept
Characteristic Beta 95% CI p-value
(Intercept) 154 138, 170 <0.001
current smoker 54 15, 94 0.007
age in years -0.14 -0.38, 0.09 0.2
current smoker * age in years -0.79 -1.4, -0.17 0.012
Abbreviation: CI = Confidence Interval
[R code]
hers_lm_no_int |> 
  tbl_regression(intercept = TRUE)
(b) No intercept
Characteristic Beta 95% CI p-value
age in years 2.1 2.1, 2.2 <0.001
age in years * current smoker 0.19 0.12, 0.26 <0.001
Abbreviation: CI = Confidence Interval
Figure 12: hers data models with and without intercepts
[R code]
library(sjPlot)

hers_plot1 <- hers_lm_no_int |>
  sjPlot::plot_model(
    type = "pred",
    terms = c("age", "smoking"),
    show.data = TRUE
  ) +
  facet_wrap(~group_col, ncol = 1) +
  expand_limits(y = 0) +
  theme(legend.position = "bottom")

hers_plot1
(a) No intercept

[R code]
library(sjPlot)

hers_plot2 <- hers_lm_with_int |>
  sjPlot::plot_model(
    type = "pred",
    terms = c("age", "smoking"),
    show.data = TRUE
  ) +
  facet_wrap(~group_col, ncol = 1) +
  expand_limits(y = 0) +
  theme(legend.position = "bottom")

hers_plot2
(b) With intercept

Residuals

Definition 4 (Residual noise/deviation from the population mean) The residual noise in a probabilistic model \(p(Y)\), also known as the residual deviation of an observation from its population mean or residual for short, is the difference between an observed value \(y\) and its population mean:

\[\varepsilon(y) \stackrel{\text{def}}{=}y - \text{E}{\left[Y\right]} \tag{10}\]

We can rearrange Equation 10 to view \(y\) as the sum of its mean plus the residual noise:

\[y = \text{E}{\left[Y\right]} + \varepsilon{\left(y\right)}\]

Theorem 1 (Residuals in Gaussian models) If \(Y\) has a Gaussian distribution, then \(\varepsilon(Y)\) also has a Gaussian distribution, and vice versa.

Proof. Left to the reader.

Definition 5 (Residuals of a fitted model value) The residual of a fitted value \(\hat y\) (shorthand: “residual”) is its error relative to the observed data: \[ \begin{aligned} e(\hat y) &\stackrel{\text{def}}{=}\varepsilon{\left(\hat y\right)} \\&= y - \hat y \end{aligned} \]

Example 3 (residuals in birthweight data)  

[R code]
plot1_interact +
  facet_wrap(~sex) +
  geom_segment(
    aes(
      x = age,
      y = predlm2,
      xend = age,
      yend = weight,
      col = sex,
      group = id
    ),
    linetype = "dotted"
  )
Figure 13: Fitted values and residuals for interaction model for birthweight data

Residuals of fitted values vs residual noise

\(e(\hat y)\) can be seen as the maximum likelihood estimate of the residual noise:

\[ \begin{aligned} e(\hat y) &= y - \hat y \\ &= \hat\varepsilon_{ML} \end{aligned} \]

General characteristics of residuals

Theorem 2 If \(\hat{\text{E}}{\left[Y\right]}\) is an unbiased estimator of the mean \(\text{E}{\left[Y\right]}\), then:

\[\text{E}{\left[e(y)\right]} = 0 \tag{11}\] \[\text{Var}{\left(e(y)\right)} \approx \sigma^2 \tag{12}\]

Proof.  

Equation 11:

\[ \begin{aligned} \text{E}{\left[e(y)\right]} &= \text{E}{\left[y - \hat y\right]} \\ &= \text{E}{\left[y\right]} - \text{E}{\left[\hat y\right]} \\ &= \text{E}{\left[y\right]} - \text{E}{\left[y\right]} \\ &= 0 \end{aligned} \]

Equation 12:

\[ \begin{aligned} \text{Var}{\left(e(y)\right)} &= \text{Var}{\left(y - \hat y\right)} \\ &= \text{Var}{\left(y\right)} + \text{Var}{\left(\hat y\right)} - 2 \text{Cov}{\left(y, \hat y\right)} \\ &{\dot{\approx}} \text{Var}{\left(y\right)} + 0 - 2 \cdot 0 \\ &= \text{Var}{\left(y\right)} \\ &= \sigma^2 \end{aligned} \]

Characteristics of residuals in Gaussian models

With enough data and a correct model, the residuals will be approximately Guassian distributed, with variance \(\sigma^2\), which we can estimate using \(\hat\sigma^2\); that is:

\[ e_i \ \sim_{\text{iid}}\ N(0, \hat\sigma^2) \]

Computing residuals in R

R provides a function for residuals:

[R code]
resid(bw_lm2)
#>         1         2         3         4         5         6         7         8 
#>  176.2667 -140.7333 -144.1333  -59.5333  177.4667 -126.9333  -68.9333  242.6667 
#>         9        10        11        12        13        14        15        16 
#> -139.3333   51.6667  156.6667 -125.1333  274.2759 -137.7069  -27.6897 -246.6897 
#>        17        18        19        20        21        22        23        24 
#> -191.6724  189.3276  -11.6724 -242.6379  -47.6379  262.3621  210.3621  -30.6207

Exercise 13 Check R’s output by computing the residuals directly.

Solution.  

[R code]
bw$weight - fitted(bw_lm2)
#>         1         2         3         4         5         6         7         8 
#>  176.2667 -140.7333 -144.1333  -59.5333  177.4667 -126.9333  -68.9333  242.6667 
#>         9        10        11        12        13        14        15        16 
#> -139.3333   51.6667  156.6667 -125.1333  274.2759 -137.7069  -27.6897 -246.6897 
#>        17        18        19        20        21        22        23        24 
#> -191.6724  189.3276  -11.6724 -242.6379  -47.6379  262.3621  210.3621  -30.6207

This matches R’s output!

Graphing the residuals

Figure 14: Fitted values and residuals for interaction model for birthweight data
[R code]
plot1_interact +
  facet_wrap(~sex) +
  geom_segment(
    aes(
      x = age,
      y = predlm2,
      xend = age,
      yend = weight,
      col = sex,
      group = id
    ),
    linetype = "dotted"
  )
(a) fitted values

[R code]

bw <- bw |>
  mutate(
    resids_intxn =
      weight - fitted(bw_lm2)
  )

plot_bw_resid <-
  bw |>
  ggplot(aes(
    x = age,
    y = resids_intxn,
    linetype = sex,
    shape = sex,
    col = sex
  )) +
  theme_bw() +
  xlab("Gestational age (weeks)") +
  ylab("residuals (grams)") +
  theme(legend.position = "bottom") +
  geom_hline(aes(
    yintercept = 0,
    col = sex
  )) +
  geom_segment(
    aes(yend = 0),
    linetype = "dotted"
  ) +
  geom_point()
# expand_limits(y = 0, x = 0) +
geom_point(alpha = .7)
#> geom_point: na.rm = FALSE
#> stat_identity: na.rm = FALSE
#> position_identity
print(plot_bw_resid + facet_wrap(~sex))
(b) Residuals

Residuals versus predictors

[R code]
hers <- hers |>
  mutate(
    resids_no_intcpt =
      LDL - fitted(hers_lm_no_int),
    resids_with_intcpt =
      LDL - fitted(hers_lm_with_int)
  )
Figure 15: Residuals of hers data vs predictors
[R code]
hers |>
  arrange(age) |>
  ggplot() +
  aes(x = age, y = resids_no_intcpt, col = factor(smoking)) +
  geom_point() +
  geom_hline(aes(yintercept = 0, col = factor(smoking))) +
  facet_wrap(~smoking, labeller = "label_both") +
  theme(legend.position = "bottom") +
  geom_smooth(col = "blue")
(a) no intercept

[R code]
hers |>
  arrange(age) |>
  ggplot() +
  aes(x = age, y = resids_with_intcpt, col = factor(smoking)) +
  geom_point() +
  geom_hline(aes(yintercept = 0, col = factor(smoking))) +
  facet_wrap(~smoking, labeller = "label_both") +
  theme(legend.position = "bottom") +
  geom_smooth(col = "blue")
(b) with intercept

Residuals versus fitted values

[R code]
library(ggfortify)
hers_lm_no_int |>
  update(na.action = na.omit) |>
  autoplot(
    which = 1,
    ncol = 1,
    smooth.colour = NA
  ) +
  geom_hline(yintercept = 0, col = "red")
Figure 16: Residuals of interaction model for hers data

We can add a LOESS smooth to visualize where the residual mean is nonzero:

[R code]
library(ggfortify)
hers_lm_no_int |>
  update(na.action = na.omit) |>
  autoplot(
    which = 1,
    ncol = 1
  ) +
  geom_hline(yintercept = 0, col = "red")
Figure 17: Residuals of interaction model for hers data, no intercept term
Figure 18: Residuals of interaction model for hers data, with and without intercept term
[R code]
library(ggfortify)
hers_lm_no_int |>
  update(na.action = na.omit) |>
  autoplot(
    which = 1,
    ncol = 1
  ) +
  geom_hline(yintercept = 0, col = "red")
(a) no intercept term

[R code]
hers_lm_with_int |>
  update(na.action = na.omit) |>
  autoplot(
    which = 1,
    ncol = 1
  ) +
  geom_hline(yintercept = 0, col = "red")
(b) with intercept term

Definition 6 (Standardized residuals) \[r_i = \frac{e_i}{\widehat{SD}(e_i)}\]

Hence, with enough data and a correct model, the standardized residuals will be approximately standard Gaussian; that is,

\[ r_i \ \sim_{\text{iid}}\ N(0,1) \]

Marginal distributions of residuals

To look for problems with our model, we can check whether the residuals \(e_i\) and standardized residuals \(r_i\) look like they have the distributions that they are supposed to have, according to the model.

Standardized residuals in R

[R code]
rstandard(bw_lm2)
#>          1          2          3          4          5          6          7 
#>  1.1598166 -0.9260109 -0.8747917 -0.3472255  1.0350665 -0.7347315 -0.3990086 
#>          8          9         10         11         12         13         14 
#>  1.4375164 -0.8253872  0.3060646  0.9280669 -0.8761592  1.9142780 -0.8655921 
#>         15         16         17         18         19         20         21 
#> -0.1642993 -1.4637574 -1.1101599  1.0965787 -0.0676062 -1.4615865 -0.2869582 
#>         22         23         24 
#>  1.5803994  1.2671652 -0.1980543
resid(bw_lm2) / sigma(bw_lm2)
#>          1          2          3          4          5          6          7 
#>  0.9759331 -0.7791962 -0.7980209 -0.3296173  0.9825771 -0.7027900 -0.3816622 
#>          8          9         10         11         12         13         14 
#>  1.3435690 -0.7714449  0.2860621  0.8674141 -0.6928239  1.5185792 -0.7624398 
#>         15         16         17         18         19         20         21 
#> -0.1533089 -1.3658431 -1.0612299  1.0482473 -0.0646265 -1.3434099 -0.2637562 
#>         22         23         24 
#>  1.4526163  1.1647086 -0.1695371
[R code]
rstandard_compare_plot <-
  tibble(
    x = resid(bw_lm2) / sigma(bw_lm2),
    y = rstandard(bw_lm2)
  ) |>
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  theme_bw() +
  coord_equal() +
  xlab("resid(bw_lm2)/sigma(bw_lm2)") +
  ylab("rstandard(bw_lm2)") +
  geom_abline(
    aes(
      intercept = 0,
      slope = 1,
      col = "x=y"
    )
  ) +
  labs(colour = "") +
  scale_colour_manual(values = "red")

print(rstandard_compare_plot)

Let’s add these residuals to the tibble of our dataset:

[R code]
bw <-
  bw |>
  mutate(
    fitted_lm2 = fitted(bw_lm2),
    resid_lm2 = resid(bw_lm2),
    resid_lm2_alt = weight - fitted_lm2,
    std_resid_lm2 = rstandard(bw_lm2),
    std_resid_lm2_alt = resid_lm2 / sigma(bw_lm2)
  )

bw |>
  select(
    sex,
    age,
    weight,
    fitted_lm2,
    resid_lm2,
    std_resid_lm2
  )
[R code]
resid_marginal_hist <-
  bw |>
  ggplot(aes(x = resid_lm2)) +
  geom_histogram()

print(resid_marginal_hist)
Figure 19: Marginal distribution of (nonstandardized) residuals
[R code]
std_resid_marginal_hist <-
  bw |>
  ggplot(aes(x = std_resid_lm2)) +
  geom_histogram()

print(std_resid_marginal_hist)
Figure 20: Marginal distribution of standardized residuals

QQ plot of standardized residuals

[R code]
library(ggfortify)
# needed to make ggplot2::autoplot() work for `lm` objects

qqplot_lm2_auto <-
  bw_lm2 |>
  autoplot(
    which = 2, # options are 1:6; can do multiple at once
    ncol = 1
  ) +
  theme_classic()

print(qqplot_lm2_auto)

QQ plot - how it’s built

[R code]
bw <- bw |>
  mutate(
    p = (rank(std_resid_lm2) - 1 / 2) / n(), # "Blom's method"
    expected_quantiles_lm2 = qnorm(p)
  )

qqplot_lm2 <-
  bw |>
  ggplot(
    aes(
      x = expected_quantiles_lm2,
      y = std_resid_lm2,
      col = sex,
      shape = sex
    )
  ) +
  geom_point() +
  theme_classic() +
  theme(legend.position = "none") + # removing the plot legend
  ggtitle("Normal Q-Q") +
  xlab("Theoretical Quantiles") +
  ylab("Standardized residuals")

# find the expected line:

ps <- c(.25, .75) # reference probabilities
a <- quantile(rstandard(bw_lm2), ps) # empirical quantiles
b <- qnorm(ps) # theoretical quantiles

qq_slope <- diff(a) / diff(b)
qq_intcpt <- a[1] - b[1] * qq_slope

qqplot_lm2 <-
  qqplot_lm2 +
  geom_abline(slope = qq_slope, intercept = qq_intcpt)

print(qqplot_lm2)

Conditional distributions of residuals

If our Gaussian linear regression model is correct, the residuals \(e_i\) and standardized residuals \(r_i\) should have:

  • an approximately Gaussian distribution, with:
  • a mean of 0
  • a constant variance

This should be true for every value of \(x\).

If we didn’t correctly guess the functional form of the linear component of the mean, \[\text{E}[Y|X=x] = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p\]

Then the residuals might have nonzero mean.

Regardless of whether we guessed the mean function correctly, ther the variance of the residuals might differ between values of \(x\).

Residuals versus fitted values

[R code]
autoplot(bw_lm2, which = 1, ncol = 1) |> print()
Figure 21: birthweight model (Equation 2): residuals versus fitted values

Example: PLOS Medicine title length data

(Adapted from Dobson and Barnett (2018), §6.7.1)

[R code]
data(PLOS, package = "dobson")
library(ggplot2)
fig1 = 
  PLOS |> 
  ggplot(
    aes(x = authors,
        y = nchar)
  ) +
  geom_point() +
  theme(legend.position = "bottom") +
  labs(col = "") +
  guides(col=guide_legend(ncol=3))
fig1
Figure 22: Number of authors versus title length in PLOS Medicine articles

Linear fit

[R code]
lm_PLOS_linear = lm(
  formula = nchar ~ authors, 
  data = PLOS)
[R code]
fig2 = fig1 +
  geom_smooth(
    method = "lm", 
              fullrange = TRUE,
              aes(col = "lm(y ~ x)"))
fig2

library(ggfortify)
autoplot(lm_PLOS_linear, which = 1, ncol = 1)
Figure 23: Number of authors versus title length in PLOS Medicine, with linear model fit
(a) Data and fit
(b) Residuals vs fitted

Quadratic fit

[R code]
lm_PLOS_quad = lm(
  formula = nchar ~ authors + I(authors^2), 
  data = PLOS)
[R code]
fig3 = 
  fig2 + 
geom_smooth(
    method = "lm",
    fullrange = TRUE,
    formula = y ~ x + I(x ^ 2),
    aes(col = "lm(y ~ x + I(x^2))")
  )
fig3

autoplot(lm_PLOS_quad, which = 1, ncol = 1)
Figure 24: Number of authors versus title length in PLOS Medicine, with quadratic model fit
(a) Data and fit
(b) Residuals vs fitted

Linear versus quadratic fits

[R code]
library(ggfortify)
autoplot(lm_PLOS_linear, which = 1, ncol = 1)

autoplot(lm_PLOS_quad, which = 1, ncol = 1)
Figure 25: Residuals versus fitted plot for linear and quadratic fits to PLOS data
(a) Linear
(b) Quadratic

Cubic fit

[R code]
lm_PLOS_cub = lm(
  formula = nchar ~ authors + I(authors^2) + I(authors^3), 
  data = PLOS)
[R code]
fig4 = 
  fig3 + 
geom_smooth(
    method = "lm",
    fullrange = TRUE,
    formula = y ~ x + I(x ^ 2) + I(x ^ 3),
    aes(col = "lm(y ~ x + I(x^2) + I(x ^ 3))")
  )
fig4

autoplot(lm_PLOS_cub, which = 1, ncol = 1)
Figure 26: Number of authors versus title length in PLOS Medicine, with cubic model fit
(a) Data and fit
(b) Residuals vs fitted

Logarithmic fit

[R code]
lm_PLOS_log = lm(nchar ~ log(authors), data = PLOS)
[R code]
fig5 = fig4 + 
  geom_smooth(
    method = "lm",
    fullrange = TRUE,
    formula = y ~ log(x),
    aes(col = "lm(y ~ log(x))")
  )
fig5

autoplot(lm_PLOS_log, which = 1, ncol = 1)
Figure 27: logarithmic fit
(a) Data and fit
(b) Residuals vs fitted

Model selection

[R code]
anova(lm_PLOS_linear, lm_PLOS_quad)
Table 14: linear vs quadratic
[R code]
anova(lm_PLOS_quad, lm_PLOS_cub)
Table 15: quadratic vs cubic

AIC/BIC

[R code]
AIC(lm_PLOS_quad)
#> [1] 8567.61
AIC(lm_PLOS_cub)
#> [1] 8554.51
[R code]
AIC(lm_PLOS_cub)
#> [1] 8554.51
AIC(lm_PLOS_log)
#> [1] 8543.63
[R code]
BIC(lm_PLOS_cub)
#> [1] 8578.4
BIC(lm_PLOS_log)
#> [1] 8557.97

Extrapolation is dangerous

[R code]
fig_all = fig5 +
  xlim(0, 60)
fig_all
Figure 28: Number of authors versus title length in PLOS Medicine

Scale-location plot

[R code]
autoplot(bw_lm2, which = 3, ncol = 1) |> print()
Figure 29: Scale-location plot of birthweight data

Residuals versus leverage

[R code]
autoplot(bw_lm2, which = 5, ncol = 1) |> print()
Figure 30: birthweight model with interactions (Equation 2): residuals versus leverage

Diagnostics constructed by hand

[R code]
bw <-
  bw |>
  mutate(
    predlm2 = predict(bw_lm2),
    residlm2 = weight - predlm2,
    std_resid = residlm2 / sigma(bw_lm2),
    # std_resid_builtin = rstandard(bw_lm2), # uses leverage
    sqrt_abs_std_resid = std_resid |> abs() |> sqrt()
  )

Residuals vs fitted

[R code]
resid_vs_fit <- bw |>
  ggplot(
    aes(x = predlm2, y = residlm2, col = sex, shape = sex)
  ) +
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

print(resid_vs_fit)

Standardized residuals vs fitted

[R code]
bw |>
  ggplot(
    aes(x = predlm2, y = std_resid, col = sex, shape = sex)
  ) +
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

Standardized residuals vs gestational age

[R code]
bw |>
  ggplot(
    aes(x = age, y = std_resid, col = sex, shape = sex)
  ) +
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

sqrt(abs(rstandard())) vs fitted

Compare with autoplot(bw_lm2, 3)

[R code]
bw |>
  ggplot(
    aes(x = predlm2, y = sqrt_abs_std_resid, col = sex, shape = sex)
  ) +
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

1.9 Model selection

(adapted from Dobson and Barnett (2018) §6.3.3; for more information on prediction, see James et al. (2013) and Harrell (2015)).

Mean squared error

We might want to minimize the mean squared error, \(\text E[(y-\hat y)^2]\), for new observations that weren’t in our data set when we fit the model.

Unfortunately, \[\frac{1}{n}\sum_{i=1}^n (y_i-\hat y_i)^2\] gives a biased estimate of \(\text E[(y-\hat y)^2]\) for new data. If we want an unbiased estimate, we will have to be clever.

Cross-validation

[R code]
data("carbohydrate", package = "dobson")
library(cvTools)
full_model <- lm(carbohydrate ~ ., data = carbohydrate)
cv_full <-
  full_model |> cvFit(
    data = carbohydrate, K = 5, R = 10,
    y = carbohydrate$carbohydrate
  )

reduced_model <- full_model |> update(formula = ~ . - age)

cv_reduced <-
  reduced_model |> cvFit(
    data = carbohydrate, K = 5, R = 10,
    y = carbohydrate$carbohydrate
  )
[R code]
results_reduced <-
  tibble(
    model = "wgt+protein",
    errs = cv_reduced$reps[]
  )
results_full <-
  tibble(
    model = "wgt+age+protein",
    errs = cv_full$reps[]
  )

cv_results <-
  bind_rows(results_reduced, results_full)

cv_results |>
  ggplot(aes(y = model, x = errs)) +
  geom_boxplot()

comparing metrics

[R code]
compare_results <- tribble(
  ~model, ~cvRMSE, ~r.squared, ~adj.r.squared, ~trainRMSE, ~loglik,
  "full",
  cv_full$cv,
  summary(full_model)$r.squared,
  summary(full_model)$adj.r.squared,
  sigma(full_model),
  logLik(full_model) |> as.numeric(),
  "reduced",
  cv_reduced$cv,
  summary(reduced_model)$r.squared,
  summary(reduced_model)$adj.r.squared,
  sigma(reduced_model),
  logLik(reduced_model) |> as.numeric()
)

compare_results
[R code]
anova(full_model, reduced_model)

stepwise regression

[R code]
library(olsrr)
olsrr:::ols_step_both_aic(full_model)
#> 
#> 
#>                              Stepwise Summary                              
#> -------------------------------------------------------------------------
#> Step    Variable         AIC        SBC       SBIC       R2       Adj. R2 
#> -------------------------------------------------------------------------
#>  0      Base Model     140.773    142.764    83.068    0.00000    0.00000 
#>  1      protein (+)    137.950    140.937    80.438    0.21427    0.17061 
#>  2      weight (+)     132.981    136.964    77.191    0.44544    0.38020 
#> -------------------------------------------------------------------------
#> 
#> Final Model Output 
#> ------------------
#> 
#>                          Model Summary                          
#> ---------------------------------------------------------------
#> R                       0.667       RMSE                 5.505 
#> R-Squared               0.445       MSE                 30.301 
#> Adj. R-Squared          0.380       Coef. Var           15.879 
#> Pred R-Squared          0.236       AIC                132.981 
#> MAE                     4.593       SBC                136.964 
#> ---------------------------------------------------------------
#>  RMSE: Root Mean Square Error 
#>  MSE: Mean Square Error 
#>  MAE: Mean Absolute Error 
#>  AIC: Akaike Information Criteria 
#>  SBC: Schwarz Bayesian Criteria 
#> 
#>                                ANOVA                                
#> -------------------------------------------------------------------
#>                 Sum of                                             
#>                Squares        DF    Mean Square      F        Sig. 
#> -------------------------------------------------------------------
#> Regression     486.778         2        243.389    6.827    0.0067 
#> Residual       606.022        17         35.648                    
#> Total         1092.800        19                                   
#> -------------------------------------------------------------------
#> 
#>                                   Parameter Estimates                                    
#> ----------------------------------------------------------------------------------------
#>       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
#> ----------------------------------------------------------------------------------------
#> (Intercept)    33.130        12.572                  2.635    0.017     6.607    59.654 
#>     protein     1.824         0.623        0.534     2.927    0.009     0.509     3.139 
#>      weight    -0.222         0.083       -0.486    -2.662    0.016    -0.397    -0.046 
#> ----------------------------------------------------------------------------------------

Lasso

\[\arg \max_{\theta} {\left\{\ell(\theta) - \lambda \sum_{j=1}^p|\beta_j|\right\}}\]

[R code]
library(glmnet)
y <- carbohydrate$carbohydrate
x <- carbohydrate |>
  select(age, weight, protein) |>
  as.matrix()
fit <- glmnet(x, y)
[R code]
autoplot(fit, xvar = "lambda")
Figure 31: Lasso selection
[R code]
cvfit <- cv.glmnet(x, y)
plot(cvfit)

[R code]
coef(cvfit, s = "lambda.1se")
#> 4 x 1 sparse Matrix of class "dgCMatrix"
#>             lambda.1se
#> (Intercept) 34.3091615
#> age          .        
#> weight      -0.0800142
#> protein      0.7640510

1.10 Categorical covariates with more than two levels

Example: birthweight

In the birthweight example, the variable sex had only two observed values:

[R code]
unique(bw$sex)
#> [1] female male  
#> Levels: female male

If there are more than two observed values, we can’t just use a single variable with 0s and 1s.

[R code]
head(iris)
Table 16: The iris data
[R code]
library(table1)
table1(
  x = ~ . | Species,
  data = iris,
  overall = FALSE
)
Table 17: Summary statistics for the iris data
setosa
(N=50)
versicolor
(N=50)
virginica
(N=50)
Sepal.Length
Mean (SD) 5.01 (0.352) 5.94 (0.516) 6.59 (0.636)
Median [Min, Max] 5.00 [4.30, 5.80] 5.90 [4.90, 7.00] 6.50 [4.90, 7.90]
Sepal.Width
Mean (SD) 3.43 (0.379) 2.77 (0.314) 2.97 (0.322)
Median [Min, Max] 3.40 [2.30, 4.40] 2.80 [2.00, 3.40] 3.00 [2.20, 3.80]
Petal.Length
Mean (SD) 1.46 (0.174) 4.26 (0.470) 5.55 (0.552)
Median [Min, Max] 1.50 [1.00, 1.90] 4.35 [3.00, 5.10] 5.55 [4.50, 6.90]
Petal.Width
Mean (SD) 0.246 (0.105) 1.33 (0.198) 2.03 (0.275)
Median [Min, Max] 0.200 [0.100, 0.600] 1.30 [1.00, 1.80] 2.00 [1.40, 2.50]

If we want to model Sepal.Length by species, we could create a variable \(X\) that represents “setosa” as \(X=1\), “virginica” as \(X=2\), and “versicolor” as \(X=3\).

[R code]
data(iris) # this step is not always necessary, but ensures you're starting
# from the original version of a dataset stored in a loaded package

iris <-
  iris |>
  tibble() |>
  mutate(
    X = case_when(
      Species == "setosa" ~ 1,
      Species == "virginica" ~ 2,
      Species == "versicolor" ~ 3
    )
  )

iris |>
  distinct(Species, X)
Table 18: iris data with numeric coding of species

Then we could fit a model like:

[R code]
iris_lm1 <- lm(Sepal.Length ~ X, data = iris)
iris_lm1 |>
  parameters() |>
  print_md()
Table 19: Model of iris data with numeric coding of Species
Parameter Coefficient SE 95% CI t(148) p
(Intercept) 4.91 0.16 (4.60, 5.23) 30.83 < .001
X 0.46 0.07 (0.32, 0.61) 6.30 < .001

Let’s see how that model looks:

[R code]
iris_plot1 <- iris |>
  ggplot(
    aes(
      x = X,
      y = Sepal.Length
    )
  ) +
  geom_point(alpha = .1) +
  geom_abline(
    intercept = coef(iris_lm1)[1],
    slope = coef(iris_lm1)[2]
  ) +
  theme_bw(base_size = 18)
print(iris_plot1)
Figure 32: Model of iris data with numeric coding of Species

We have forced the model to use a straight line for the three estimated means. Maybe not a good idea?

Let’s see what R does with categorical variables by default:

[R code]
iris_lm2 <- lm(Sepal.Length ~ Species, data = iris)
iris_lm2 |>
  parameters() |>
  print_md()
Table 20: Model of iris data with Species as a categorical variable
Parameter Coefficient SE 95% CI t(147) p
(Intercept) 5.01 0.07 (4.86, 5.15) 68.76 < .001
Species (versicolor) 0.93 0.10 (0.73, 1.13) 9.03 < .001
Species (virginica) 1.58 0.10 (1.38, 1.79) 15.37 < .001

Re-parametrize with no intercept

If you don’t want the default and offset option, you can use “-1” like we’ve seen previously:

[R code]
iris_lm2_no_int <- lm(Sepal.Length ~ Species - 1, data = iris)
iris_lm2_no_int |>
  parameters() |>
  print_md()
Table 21
Parameter Coefficient SE 95% CI t(147) p
Species (setosa) 5.01 0.07 (4.86, 5.15) 68.76 < .001
Species (versicolor) 5.94 0.07 (5.79, 6.08) 81.54 < .001
Species (virginica) 6.59 0.07 (6.44, 6.73) 90.49 < .001

Let’s see what these new models look like:

[R code]
iris_plot2 <-
  iris |>
  mutate(
    predlm2 = predict(iris_lm2)
  ) |>
  arrange(X) |>
  ggplot(aes(x = X, y = Sepal.Length)) +
  geom_point(alpha = .1) +
  geom_line(aes(y = predlm2), col = "red") +
  geom_abline(
    intercept = coef(iris_lm1)[1],
    slope = coef(iris_lm1)[2]
  ) +
  theme_bw(base_size = 18)

print(iris_plot2)
Figure 33

Let’s see how R did that:

[R code]
formula(iris_lm2)
#> Sepal.Length ~ Species
model.matrix(iris_lm2) |>
  as_tibble() |>
  unique()
Table 22

This format is called a “corner point parametrization” (e.g., in Dobson and Barnett (2018)) or “treatment coding” (e.g., in Dunn and Smyth (2018)).

The default contrasts are controlled by options("contrasts"):

[R code]
options("contrasts")
#> $contrasts
#>         unordered           ordered 
#> "contr.treatment"      "contr.poly"

See ?options for more details.

[R code]
formula(iris_lm2_no_int)
#> Sepal.Length ~ Species - 1
model.matrix(iris_lm2_no_int) |>
  as_tibble() |>
  unique()
Table 23

This format is called a “group point parametrization” (e.g., in Dobson and Barnett (2018)).

There are more options; see Dobson and Barnett (2018) §6.4.1 and the codingMatrices package vignette (Venables (2023)).

1.11 Ordinal covariates

(c.f. Dobson and Barnett (2018) §2.4.4)

Example 4  

[R code]
url <- paste0(
  "https://regression.ucsf.edu/sites/g/files/tkssra6706/",
  "f/wysiwyg/home/data/hersdata.dta"
)
library(haven)
hers <- read_dta(url)
[R code]
hers |> head()
Table 24: HERS dataset

Check out ?codingMatrices::contr.diff

Anderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.” Bulletin of American Iris Society 59: 2–5.
Chatterjee, Samprit, and Ali S Hadi. 2015. Regression Analysis by Example. John Wiley & Sons. https://www.wiley.com/en-us/Regression+Analysis+by+Example%2C+4th+Edition-p-9780470055458.
Dobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.
Dunn, Peter K, and Gordon K Smyth. 2018. Generalized Linear Models with Examples in R. Vol. 53. Springer. https://link.springer.com/book/10.1007/978-1-4419-0118-7.
Faraway, Julian J. 2025. Linear Models with R. https://www.routledge.com/Linear-Models-with-R/Faraway/p/book/9781032583983.
Harrell, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. 2nd ed. Springer. https://doi.org/10.1007/978-3-319-19425-7.
Hogg, Robert V., Elliot A. Tanis, and Dale L. Zimmerman. 2015. Probability and Statistical Inference. Ninth edition. Boston: Pearson.
Hulley, Stephen, Deborah Grady, Trudy Bush, Curt Furberg, David Herrington, Betty Riggs, Eric Vittinghoff, for the Heart, and Estrogen/progestin Replacement Study (HERS) Research Group. 1998. “Randomized Trial of Estrogen Plus Progestin for Secondary Prevention of Coronary Heart Disease in Postmenopausal Women.” JAMA : The Journal of the American Medical Association 280 (7): 605–13.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer. https://www.statlearning.com/.
Kleinbaum, David G, and Mitchel Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-1742-3.
———. 2012. Survival Analysis: A Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-6646-9.
Kleinbaum, David G, Lawrence L Kupper, Azhar Nizam, K Muller, and ES Rosenberg. 2014. Applied Regression Analysis and Other Multivariable Methods. 5th ed. Cengage Learning. https://www.cengage.com/c/applied-regression-analysis-and-other-multivariable-methods-5e-kleinbaum/9781285051086/.
Kutner, Michael H, Christopher J Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models. McGraw-Hill.
Polin, Richard A, William W Fox, and Steven H Abman. 2011. Fetal and Neonatal Physiology. 4th ed. Elsevier health sciences.
Seber, George AF, and Alan J Lee. 2012. Linear Regression Analysis. 2nd ed. John Wiley & Sons. https://www.wiley.com/en-us/Linear+Regression+Analysis%2C+2nd+Edition-p-9781118274422.
Venables, Bill. 2023. codingMatrices: Alternative Factor Coding Matrices for Linear Model Formulae (version 0.4.0). https://CRAN.R-project.org/package=codingMatrices.
Vittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. Springer. https://doi.org/10.1007/978-1-4614-1353-0.
Weisberg, Sanford. 2005. Applied Linear Regression. Vol. 528. John Wiley & Sons.